Unifying Event Detection and Captioning as Sequence Generation via Pre-training
نویسندگان
چکیده
Dense video captioning aims to generate corresponding text descriptions for a series of events in the untrimmed video, which can be divided into two sub-tasks, event detection and captioning. Unlike previous works that tackle sub-tasks separately, recent have focused on enhancing inter-task association between sub-tasks. However, designing interactions is not trivial due large differences their task specific solutions. Besides, methods normally ignore temporal dependencies events, leading redundancy or inconsistency problems. To above defects, this paper, we define as sequence generation propose unified pre-training fine-tuning framework naturally enhance Since model predicts each with context, inter-dependency fully exploited thus our detect more diverse consistent video. Experiments ActivityNet dataset show outperforms state-of-the-art methods, further boosted when pre-trained extra large-scale video-text data. Code available at https://github.com/QiQAng/UEDVC .
منابع مشابه
Consensus-based Sequence Training for Video Captioning
Captioning models are typically trained using the crossentropy loss. However, their performance is evaluated on other metrics designed to better correlate with human assessments. Recently, it has been shown that reinforcement learning (RL) can directly optimize these metrics in tasks such as captioning. However, this is computationally costly and requires specifying a baseline reward at each st...
متن کاملActor-Critic Sequence Training for Image Captioning
Generating natural language descriptions of images is an important capability for a robot or other visual-intelligence driven AI agent that may need to communicate with human users about what it is seeing. Such image captioning methods are typically trained by maximising the likelihood of ground-truth annotated caption given the image. While simple and easy to implement, this approach does not ...
متن کاملScale Up Event Extraction Learning via Automatic Training Data Generation
The task of event extraction has long been investigated in a supervised learning paradigm, which is bound by the number and the quality of the training instances. Existing training data must be manually generated through a combination of expert domain knowledge and extensive human involvement. However, due to drastic efforts required in annotating text, the resultant datasets are usually small,...
متن کاملSequence to Sequence Model for Video Captioning
Automatically generating video captions with natural language remains a challenge for both the field of nature language processing and computer vision. Recurrent Neural Networks (RNNs), which models sequence dynamics, has proved to be effective in visual interpretation. Based on a recent sequence to sequence model for video captioning, which is designed to learn the temporal structure of the se...
متن کاملTest generation using event sequence graphs
An Event Sequence Graph (ESG) is a simple albeit powerful formalism for capturing the behavior of a variety of interactive systems that include real-time, embedded systems, and graphical user interfaces. A collection of ESGs is proposed as a model of an interactive system. This collection is used for the generation of tests to check for the correctness of system behavior in the presence of expe...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Lecture Notes in Computer Science
سال: 2022
ISSN: ['1611-3349', '0302-9743']
DOI: https://doi.org/10.1007/978-3-031-20059-5_21